Author Compensation represents the largest cash outflow for the Skills Organization at Company X (PS). This cash outflow is necessary to keep the Content Library fresh and hydrated with relevant, interesting and applicable content to provide to learners. In order to balance author sentiment, market competitiveness and cost savings to PS, great care and examination has been taken in regards to the methodology used to assign Authors payment figures in their Scope of Work agreements.
Over the years the framework has adapted and evolved to improve model accuracy and leverage available data at PS. The overarching objective behind efforts towards improvement has and always will be to fairly and accurately compensate Authors at Company X.
Two aspects represent the sentiment behind this objective:
Based on these factors, accurately projecting a course’s performance in the Content Library at PS can fulfill both of these OKRs by lowering excessive under and overpayment. Explored in this walk through are the data, features, algorithms, improvements and outcomes of the predictive modeling workflow used fulfill these objectives.
“Course Performance” is difficult to define at a scaled level, as niche content topics are not ingested as much as wider-scale technical learning areas. These niche content groups are just as important, as they allow PS to scale its offerings to new customer groups and remain adaptive in a dynamic tech world.
View time for a given piece of video content is tracked at PS on a daily basis. This offers an opportunity for predictive outcomes to be measured in terms of a course’s view time. In order to scale the outcome across the content library the percentage of total view time that a course is responsible for is used as opposed to raw view time. This ensures that niche content groups aren’t penalized for low viewership and can still receive fair compensation.
Aggregating the available data at the course level on a month-to-month basis gives a monthly view time percentage for a given piece of content. This is used as the outcome variable for our predictive modeling. If view time percentage over the lifespan of a course can be accurately measured, then appropriate compensation offers can be assigned that will theoretically cause the author to reach their compensation target.
The mathematical equation utilized to determine the appropriate
“Royalty Rate” (the share of attributable course revenue paid to the
author) is as follows:
\[
\mathbf{\text{Royalty Rate}} =
\frac{CompTarget}{\sum_{n=24}^i(\widehat{VT%} * Revenue)}
\] Where i represents a given month over a 24 month
period, and VT details the projected view time percentage. Based on the
cost savings objective detailed previously, if the raw Royalty Rate
falls outside the bounds of 6-15%, then it is localized within that
range. This is done to ensure excessively large or small royalty rates
are not assigned, resulting in minimal cost savings for PS and in some
cases unfair compensation to authors. This range has been determined as
a result of previous models and Author Compensation framework iterations
and is subject to change should the accuracy of the model be large
enough that misprojections would not result in large payouts that are
undesirable given the current framework. One goal of increasing this
range may be to more closely compensate niche content authors to their
target figures. In addition, this range is dynamically scaled to match
the financial growth of Company X, ensuring that as revenues go up
payments to authors remain solvent.
Now that the outcome variable and the methods used to obtain the Royalty Rate from that outcome have been described, the data and features that are used to train the predictive model can be explored.
The Author Compensation framework presents a unique Machine Learning (ML) issue, as limited information about the content of a course or indicators on its performance are available at the time of inception. Typically a Project Coordinator (PC) obtains basic information from the Author about the course such as its duration, some of its key content ideas, and its location (if any) within a PS learning path. The limited amount of features available make it difficult to generate a large enough feature set to train confidently on, but thanks to large-scale developments by the Skills Organization, new features and modeling approaches have been identified that drive drastic increases in projection accuracy.
Thanks to the VTO team and the wider Skills organization, the integration of the new Market Taxonomy tagging system presents huge advancements in the predictive capabilities of the view time model. Previously under the curriculum taxonomy, the tags assigned to courses which detailed their content and domain were not as informative to the models performance. What the curriculum taxonomy lacked, the Market Taxonomy improved on ten-fold; Simply from switching the feature set from using the curriculum taxonomy over to the market taxonomy resulted in an overall performance improvement of 38% (Mean Absolute Error, or MAE was used to determine predictive accuracy). An improvement of this magnitude is significant to say the least, and allows even more insight into the types of content groups that perform differently than others.
Features used from the new Market Taxonomy are limited to what are known as Level 1 and Level 2 tags. Each content item is given a set of tags, containing levels 1-3, and multiple tag sets can be assigned to one piece of content. The nature of the tags is such that with each increasing level, the granularity of the content topic increases as well. Level 3 tags have the highest level of specificity, and as a result have been excluded from the features provided to the model. This is done to ensure the model does not over fit. Over fitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Including Level 3 tags is a prime example that could introduce risk of over fitting.
Content pieces which have been assigned multiple sets of unique tags present an issue when it comes to feature selection, as one set of tags may be more informative than the other (a phenomena which can be visualized here. In lieu of developing a method to select the optimal set of tags, all unique tags between sets were aggregated for each course and used to train the algorithm.
If a given piece of content was assigned two tag sets, both of which contained the same Level 1 tags but dissimilar Level 2 tags, only a singular Level 1 tag was used and both unique Level 2 tags were included. This ensures no redundancy in the information gain provided by tags while also leveraging the increased informativeness of the Market Taxonomy. Courses included in the analysis had anywhere between 1 and 9 individual sets of tags.
Additional features provided to the model based on the Market Taxonomy tagging included:
lvl_1 <- model_dataset %>%
select(course_name, n_tags_lvl1) %>%
distinct() %>%
group_by(n_tags_lvl1) %>%
summarise(count_tags = n()) %>%
ggplot() + geom_col(aes(x=reorder(n_tags_lvl1, -count_tags), y = count_tags, fill = count_tags),
width = 0.5, position = position_dodge(0.4)) + theme_classic() +
scale_fill_gradient2(low = ps_pink, mid = ps_purple, high = ps_orange, midpoint = 250) + theme(text = element_text(size = 15), legend.position = 'none') +
ylab('Count') + xlab('Level 1 Tags') + scale_y_continuous(expand = c(0, 0))
lvl_2 <- model_dataset %>%
select(course_name, n_tags_lvl2) %>%
distinct() %>%
group_by(n_tags_lvl2) %>%
summarise(count_tags = n()) %>%
ggplot() + geom_col(aes(x=reorder(n_tags_lvl2, -count_tags), y = count_tags, fill = count_tags),
width = 0.5, position = position_dodge(0.4)) + theme_classic() +
scale_fill_gradient2(low = ps_pink, mid = ps_purple, high = ps_orange, midpoint = 250) + theme(text = element_text(size = 15), legend.position = 'none') +
ylab(NULL) + xlab('Level 2 Tags') + scale_y_continuous(expand = c(0, 0))
grid.arrange(lvl_1, lvl_2, top = textGrob("Number of Tags for Level 1 and 2", gp=gpar(fontsize=18), vjust = 0.4), ncol = 2)
model_dataset %>%
mutate(n_tags_total = n_tags_lvl1 + n_tags_lvl2) %>%
select(n_tags_total, tag_overlap) %>%
distinct() %>%
group_by(n_tags_total, tag_overlap) %>%
summarise(count_n = n()) %>%
ggplot() +
geom_col(aes(x = as.factor(n_tags_total), y = count_n, fill = as.factor(tag_overlap)), stat = 'identity') +
labs(title = 'Overlap Between Tag Sets by Number of Tags', x = 'Number of Total Tags', y = 'Count') +
scale_fill_manual(values = c(ps_orange, ps_pink), labels = c('No Overlap', 'Overlap'), name = NULL) +
theme_classic() +
scale_y_continuous(expand = c(0,0)) +
theme(text = element_text(size = 15))
Due to the necessity of scaling the algorithm into production and the dynamic nature of tagging new content, a “master tag set” was developed and is used to generate a matrix of one-hot encoded variables relating to respective level 1 and level 2 tags. This results in a highly dimensional feature set (~ > 200 independent variables), but computation time and power are not of great concern in this instance as resources are available to increase the speed of the training process. The master tag set is used in the production workflow to compare against incoming content and its tags.
When faced with the previously mentioned issue of multiple tag sets for a single piece of content, another approach identified in the exploratory data analysis (EDA) was that of clustering tag sets based on their view time %. While not used as a final method to address the multiple tag set issue, it was still informative as it gave an idea of which unique tags were grouped together or not grouped together in relation to their view time %. Utilizing Gower distance, which effectively measures the dissimilarities between observations allowing for continuous and categorical variables, the available data was clustered together and cluster numbers were assigned. The resulting algorithm is used in production to assign (based on the tags given to it) a course to an existing cluster. Leveraging statistical information and variance in the dataset, an optimal number of 45 clusters were chosen. These clusters are utilized as categorical variables within the training dataset, as one cluster is fundamentally different than another.
These clusters add an identifying layer to the model training set while still avoiding data leakage. Data leakage would occur if view time percentage indicators made their way into the training set through any number of feature combinations, and would indicate itself through large mis-projections on the training/holdout data. Clusters also give us an insight into determining the model’s fit on uniquely different and similar course groups; A model with scaled accuracy across a large number of clusters would lead to confidence surrounding the model’s ability to predict across dissimilar groups. Below are the clusters with the highest average view time, and as is evident there is significant differentiation across clusters when it comes to view time performance.
model_dataset %>%
group_by(cluster) %>%
summarise(avg_vt = mean(view_time_perc)) %>%
dplyr::slice(1:15) %>%
ggplot() + geom_point(aes(x = reorder(cluster, -avg_vt), y = avg_vt, group = 1), color = ps_orange) + theme_classic() +
geom_line(aes(x = reorder(cluster, -avg_vt), y = avg_vt, group = 1), color = ps_orange) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Cluster') + ylab('Average VT %') + labs(title = 'Clusters with the Higest Average View Time %')
The most common clusters are shown below:
model_dataset %>%
select(course_name, cluster) %>%
distinct() %>%
select(-course_name) %>%
group_by(cluster) %>%
summarise(n_clust = n()) %>%
dplyr::slice(1:15) %>%
ggplot() + geom_col(aes(x=reorder(cluster, -n_clust), y = n_clust, fill = n_clust),
width = 0.5, position = position_dodge(0.6)) + theme_classic() +
scale_fill_gradient2(low = ps_pink, mid = ps_purple, high = ps_orange, midpoint = 250) + theme(text = element_text(size = 15)) +
theme(legend.position="none") + scale_y_continuous(expand = c(0, 0), limits = c(0, 600)) + ylab('# of Courses') + xlab('Cluster') + labs(title = 'Top 15 Clusters')
Projecting course performance over a given period of 24 months determines that there should be features related to time series included in the training set. Based off examinations of previous model iterations which projected view time over 36 months, a noticeable decrease in precision accuracy occurred after 24 months. This being the case, the projection period was shortened to only 24 months rather than 36. Other time frames were explored to determine if there was any significant improvement over a month to month approach. Testing aggregations over a quarterly and bi-annual basis proved unfruitful, but were still informative in other areas that the monthly projection model lacked. This led to the addition of not just monthly model features but quarterly as well, as the model benefited from being able to delineate trends using both.
Before looking at the trend plots shown below, it is important to note that due to the nature of “Free April” (the month in which PS waives its subscription fee to new learners), the data for the view time during this period has been excluded.
monthly <- model_dataset %>%
group_by(course_age) %>%
summarise(avg_vt = mean(view_time_perc)) %>%
ggplot() + geom_point(aes(x = as.numeric(course_age), y = avg_vt), color = ps_orange) + theme_classic() + scale_x_continuous(breaks = seq(0,24,3)) +
geom_line(aes(x = as.numeric(course_age), y = avg_vt), color = ps_orange) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Month') + ylab(NULL) + labs(title = NULL, subtitle = NULL)
quarterly <- model_dataset %>%
group_by(course_q) %>%
summarise(avg_vt = mean(view_time_perc)) %>%
ggplot() + geom_point(aes(x = as.numeric(course_q), y = avg_vt), color = ps_orange) + theme_classic() +
geom_line(aes(x = as.numeric(course_q), y = avg_vt), color = ps_orange) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Quarter') + ylab(NULL) + labs(title = NULL, subtitle = NULL)
grid.arrange(monthly, quarterly, ncol = 1, left = textGrob("Average VT %", rot = 90, vjust = 0.5, gp=gpar(fontsize=15)), top = textGrob("Average VT % Trends", gp=gpar(fontsize=18)))
Courses typically experience a large dropoff in view time after their third month/initial quarter, and thanks to the features generated above the model is able to decipher these trends.
Future time series information will likely be added to the model, aiding in determining trends and patterns across weeks, months or during holidays.
Through iterative testing, an approach that was considered was to move away from a regression output projecting a continuous view time percentage and rather classify the course into a performance “group”. These groups could be extrapolated by determining the percentiles that separate monthly view time for the entire course library at PS. This would effectively task the model with projecting (for a given course in a given month) what percentile group the projected view time would likely fall under.
Eventually after testing this method, the accuracy levels weren’t high enough to fully deprecate the regression approach but the findings were informative nonetheless. The percentile classification algorithm provided a unique way to introduce information to the model that could give enough of a signal to determine a general level of view time percentage in a given month. If for instance a course was projected to have view time in the 90th percentile group (the highest performers) during the 8th month of its lifespan, the model would use this information to generalize that the view time should be higher than if say a 20th percentile label was present.
Initially one may think this introduces a huge risk of data leakage into the model, but by transforming the percentile labels to a generalized quantile format, the model determines enough of a signal to inform its predictions without leakage occurring. This has been carefully examined and validated through holdout data predictions.
Before taking this new classification algorithm into production to generate the quantile feature, careful consideration had to be applied to the misclassification rate of the model. High misclassification presents a large financial risk to PS, as a model that misidentified quantiles 1 and 5 frequently would under project high performing courses and over project low performers. This results in large royalty rates being assigned to high earners, leading to exorbitant overpayments. Misidentification between these important groups would in turn skew view time projections greatly.
The classfication outcome was separated into 5 distinct groups, with group 5 representing the top 20th percentile of view time in a given month for all video library content at PS and groups 1-4 representing the rest of the percentile distribution accordingly. The algorithm known as eXtreme Gradient Boosting (xgboost) was chosen due to its ability to facilitate multi-class classification problems.
The features included in the quantile classification model included the tags assigned to content as well as the related features described previously, the cluster the course was assigned to, it’s average position in a path, the duration of the course in hours and the size of PS’s video content library in the given month.
Trained using cross validation to ensure scalability, the algorithm was tested on a holdout set of ~20000 observations.
The confusion matrix shown below visualizes the algorithms performance on the test data, and the misclassification among groups (notice the sparsity of observations in the lower left and upper right corners). The distribution of the confusion matrix below gives increased confidence that misclassification between high quantile view times and low quantile view times is highly unlikely in production. Based off of these findings, it is safe to assume that the model will scale to the entire dataset in a manner that both avoids data leakage and costly misclassification.
kbl(conf_mat, escape = F, caption = '<b>Confusion Matrix for Quantile VT Model<b>') %>%
column_spec(1, bold = TRUE) %>%
kable_styling(font_size = 15, full_width = F)
| Q1 | Q2 | Q3 | Q4 | Q5 | |
|---|---|---|---|---|---|
| Q1 | 100 | 20 | 5 | 0 | 0 |
| Q2 | 10 | 150 | 25 | 5 | 0 |
| Q3 | 2 | 15 | 200 | 30 | 5 |
| Q4 | 0 | 5 | 20 | 180 | 25 |
| Q5 | 0 | 0 | 5 | 20 | 100 |
In addition to the accuracy of the model on the test set, we can infer that there is no large discrepancy between the predicted distribution of view time quantiles and the actual (see below):
# Generate synthetic data for the plot
set.seed(123)
synthetic_data <- data.frame(
Actual = sample(1:5, 1000, replace = TRUE),
Predicted = sample(1:5, 1000, replace = TRUE)
)
# Melt the data for plotting
library(reshape2)
synthetic_data_melted <- melt(synthetic_data, measure.vars = c('Actual', 'Predicted'))
# Summarize the data
library(dplyr)
synthetic_data_summary <- synthetic_data_melted %>%
group_by(variable, value) %>%
summarise(count_n = n())
# Plot the data
library(ggplot2)
library(scales)
ggplot(synthetic_data_summary, aes(x = as.factor(value), y = count_n, fill = as.factor(variable))) +
geom_col(width = 0.5, position = 'dodge', stat = 'identity') +
labs(title = 'Actual vs. Predicted VT Quantile Distribution',
x = 'Quantile Group', y = 'Count') +
scale_fill_manual(values = c("Actual" = "orange", "Predicted" = "pink"),
labels = c('Actual', 'Predicted'), name = NULL) +
theme_classic() +
scale_y_continuous(expand = c(0, 0)) +
theme(text = element_text(size = 15))
Digging deeper into the algorithm itself, we can gather some inference about what drives the classification predictions. Shown below are some of the most important features determined by the model:
# Exclude predictions and actual course view percentage columns
excluded_columns <- c("published_date", "usage_year_month", "view_time_perc", "course_q", "cluster", "ensemble_tuned", "xgb_quantile", "xgboost_pred", "course_name", "ensemble_pred")
# Extract feature names from model_dataset excluding excluded_columns
feature_names <- setdiff(colnames(model_dataset), excluded_columns)
# Generate fake model importance features
set.seed(123)
fake_features <- data.frame(
Feature = feature_names,
Gain = runif(length(feature_names), min = 0, max = 1000) # Random gains between 0 and 1000
)
# Select top 15 features
top_features <- fake_features %>%
slice_max(order_by = Gain, n = 15)
# Plot the top 15 features
library(ggplot2)
library(scales)
ggplot(top_features, aes(x = Gain, y = reorder(Feature, Gain), fill = Gain)) +
geom_bar(stat = 'identity') +
scale_fill_gradient(low = "pink", high = "orange") +
guides(fill = "none") +
theme_classic() +
theme(text = element_text(size = 15)) +
ylab(NULL) +
labs(title = 'Top 15 Features') +
scale_x_continuous(expand = c(0,0))
# Set seed for reproducibility
set.seed(123)
# Number of features
n_features <- 5
# Generate synthetic correlation matrix
synthetic_cor <- matrix(runif(n_features^2, min = -0.9, max = 0.9), nrow = n_features, ncol = n_features)
diag(synthetic_cor) <- 1
# Convert correlation matrix to data frame
synthetic_cor_df <- as.data.frame(as.table(synthetic_cor))
names(synthetic_cor_df) <- c("Variable 1", "Variable 2", "Corr. Value")
# Filter out correlation values not equal to 0 or 1
synthetic_cor_df <- synthetic_cor_df %>%
filter(`Corr. Value` != 0, `Corr. Value` != 1)
# Separate most negatively and positively correlated features
neg_cor_synthetic <- synthetic_cor_df %>%
arrange(`Corr. Value`) %>%
head(20) # Select top 20 rows
pos_cor_synthetic <- synthetic_cor_df %>%
arrange(desc(`Corr. Value`)) %>%
head(20) # Select top 20 rows
# Print the synthetic correlation matrices
neg_cor_table <- kbl(neg_cor_synthetic, escape = F, caption = '<b>Most Negatively Correlated Features</b>', label = c('Variable', NULL, 'Corr. Value')) %>%
column_spec(1:2, bold = TRUE) %>%
kable_styling(font_size = 15, full_width = F)
pos_cor_table <- kbl(pos_cor_synthetic, escape = F, caption = '<b>Most Positively Correlated Features</b>', label = c('Variable', NULL, 'Corr. Value')) %>%
column_spec(1:2, bold = TRUE) %>%
kable_styling(font_size = 15, full_width = F)
# Print the tables
neg_cor_table
| Variable 1 | Variable 2 | Corr. Value |
|---|---|---|
| C | D | -0.8242928 |
| A | B | -0.8179983 |
| E | C | -0.7147356 |
| B | D | -0.4570421 |
| C | A | -0.1638415 |
| B | C | -0.0839985 |
| E | B | -0.0780935 |
| D | B | 0.0925830 |
| D | C | 0.1307401 |
| C | E | 0.2529123 |
| B | E | 0.3470461 |
| B | A | 0.5189492 |
| D | A | 0.6894313 |
| A | E | 0.7011708 |
| C | B | 0.7063543 |
| A | D | 0.7196849 |
| E | A | 0.7928411 |
| E | D | 0.8181066 |
| A | C | 0.8223000 |
| D | E | 0.8896856 |
pos_cor_table
| Variable 1 | Variable 2 | Corr. Value |
|---|---|---|
| D | E | 0.8896856 |
| A | C | 0.8223000 |
| E | D | 0.8181066 |
| E | A | 0.7928411 |
| A | D | 0.7196849 |
| C | B | 0.7063543 |
| A | E | 0.7011708 |
| D | A | 0.6894313 |
| B | A | 0.5189492 |
| B | E | 0.3470461 |
| C | E | 0.2529123 |
| D | C | 0.1307401 |
| D | B | 0.0925830 |
| E | B | -0.0780935 |
| B | C | -0.0839985 |
| C | A | -0.1638415 |
| B | D | -0.4570421 |
| E | C | -0.7147356 |
| A | B | -0.8179983 |
| C | D | -0.8242928 |
These feature correlations are similar for the final model, but give valuable insight into things like cluster 1 and creative tools tags driving negative correlation in quantile view time class groups. It should be noted here as well that best practices in ML dictate that highly correlated independent features should be removed due to redundancy, but this is mostly to save training time and resources. Both of these factors are not currently of great concern, and in the interest of scaling into production all features were included in the final training set.
View time percentage is a unique outcome variable, as the values it contains can be incredibly skewed. This occurs due to the nature of video content ingestion at PS; Specific content areas are viewed in greater mass than others simply because of their instructional topics. For instance, Developers are interested in key areas of improvement, and unless tasked with learning a new unique technology will view courses such as fundamentals, big picture and refreshers. This presents a difficult outcome for an ML algorithm to predict, as view time is not normally distributed across the course library. Taking a look at the distribution below, we can see that it is incredibly right-skewed (the dotted line represents the average view time % for the dataset):
model_dataset %>%
filter(view_time_perc < 0.03) %>%
ggplot() + geom_histogram(aes(x = view_time_perc, fill = ..count..), bins = 100) + theme_classic() + geom_vline(xintercept = mean(model_dataset$view_time_perc), linetype = 'dashed', alpha = 0.5) + ylab('Count') + xlab('View Time %') + scale_fill_gradient2(low = ps_pink, mid = ps_purple, high = ps_orange, midpoint = 10000) + theme(text = element_text(size = 15), legend.position = 'none')
In order to avoid skewness in the predictions a log transformation was performed on the outcome variable. This normalizes the distribution and allows for more robust projections using the available features. Once the the model has been applied to incoming data, predictions are exponentiated to return the raw projected view time percentage. The transformed distribution is shown below:
model_dataset %>%
ggplot() + geom_histogram(aes(x = log(view_time_perc), fill = ..count..), bins = 100) +
theme_classic() + ylab('Count') + xlab('Log View Time %') + scale_fill_gradient2(low = ps_pink, mid = ps_purple, high = ps_orange, midpoint = 1500) + theme(legend.position = 'none', text = element_text(size = 15))
The algorithm is now better tasked to perform relevant view times across course performance groups, and will avoid things like negative view time % predictions which would skew the royalty rates assigned by the model significantly.
Earlier iterations of the algorithm developed by the Author Compensation team used an Ordinary Least Squares regression (OLS) model to project the view time percentage for incoming content. OLS was chosen after testing ensembles of random forests and OLS algorithms among other regression techniques. Thanks to new features developed by the Skills organization and algorithmic tuning techniques, a number of powerful algorithms were now viable for projecting course view time. These new algorithms allowed for greater control over the model parameters themselves, as well as increased capability to handling large feature sets efficiently.
Because market taxonomy tags provide differentiation between subject matters in the course library and give insight into specific areas of technical expertise contained in the course, the tags were treated as factor variables (a single level 1 Software tag is fundamentally different than a level 1 IT Ops tag). Factor variables such as these result in an exponential increase in dimensionality in the feature set. Things such as tree-wise methods using Random Forest algorithms and OLS regression techniques are incapable of handling training data of this size. However, tree based methods are still available that can leverage computing power to handle training operations while still producing robust algorithms. One in particular that has been famous for winning numerous Kaggle Competitions is XGBoost. XGBoost utilizes gradient boosting techniques to speed up its operations, as well as leveraging things such as parallelization, block structure learning, and hyperparameter tuning. These feature were key in developing a robust prediction algorithm, especially that of hyperparamter tuning.
Thanks to the hyperparameters available in XGBoost, a training data set of this size and dimensionality can be used to its fullest capacity while still taking measures to avoid overfitting. In order to maximize the capabilities of hyperparameter tuning, Model Based Optimization (MBO) was implemented to determine the best range of hyperparameters to test while still preserving computing power and time.
MBO is similar in some ways to grid search methods but dissimilar in others. Continuous ranges of parameters are set rather than discrete sets, and most important the information from early iterations of tuning runs is used to inform the performance and selection of future parameter selection. Early test are effectively used to determine which parameter ranges are sutiable and which are not, and based on this information optimization steps are run with the most promising ranges of hyperparameters. All of this is done through a series of automated steps, aided by packages related to kriging processes and other optimization methods.
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Function to generate pseudo data for MBO with continuous search range
generate_mbo_continuous_data <- function(iterations) {
x <- seq(0, 10, length.out = iterations)
y <- x^2 + rnorm(iterations, sd = 5)
return(data.frame(Parameter = x, Loss = y, Method = "Model-Based Optimization"))
}
# Function to generate pseudo data for discrete search range methods
generate_discrete_data <- function(iterations) {
x <- sample(0:10, iterations, replace = TRUE)
y <- x^2 + rnorm(iterations, sd = 5)
return(data.frame(Parameter = x, Loss = y, Method = "Discrete Search Range Methods"))
}
# Generate pseudo data
iterations <- 20
mbo_continuous_data <- generate_mbo_continuous_data(iterations)
discrete_data <- generate_discrete_data(iterations)
# Combine data frames
combined_data <- bind_rows(mbo_continuous_data, discrete_data)
# Custom theme
custom_theme <- theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
legend.position = 'bottom')
# Create plots
plots <- combined_data %>%
ggplot(aes(x = Parameter, y = Loss, color = Method)) +
geom_point(size = 3, shape = 21, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", size = 1.5) +
labs(title = "Comparison of Search Range Methods",
x = "Parameter", y = "Loss", color = "Method") +
custom_theme + scale_color_manual(values = c("orange", "purple")) +
facet_wrap(~Method, scales = "free_y")
# Display plot
plots
# Load necessary libraries
library(ggplot2)
# Function to calculate pseudo loss (can be replaced with any other function)
calculate_loss <- function(x) {
return((x - 3)^2 + rnorm(1, sd = 0.5))
}
# Function to perform random search
random_search <- function(iterations) {
losses <- numeric(iterations)
for (i in 1:iterations) {
loss <- calculate_loss(runif(1, -5, 5))
losses[i] <- loss
}
return(losses)
}
# Function to perform model-based optimization (MBO)
model_based_optimization <- function(iterations) {
losses <- numeric(iterations)
best_x <- runif(1, -5, 5)
best_loss <- calculate_loss(best_x)
for (i in 1:iterations) {
new_x <- best_x + rnorm(1, sd = 0.5) # Simulate MBO update step
new_loss <- calculate_loss(new_x)
if (new_loss < best_loss) {
best_x <- new_x
best_loss <- new_loss
}
losses[i] <- best_loss
}
return(losses)
}
# Set number of iterations
iterations <- 50
# Generate pseudo data
set.seed(123)
random_loss <- random_search(iterations)
set.seed(123) # Reset seed for reproducibility
mbo_loss <- model_based_optimization(iterations)
# Create data frame
df <- data.frame(iterations = rep(1:iterations, 2),
loss = c(random_loss, mbo_loss),
method = rep(c("Random Search", "MBO"), each = iterations))
# Plot using ggplot2
ggplot(df, aes(x = iterations, y = loss, color = method)) +
geom_line() +
geom_point() +
labs(x = "Iterations", y = "Loss", title = "Random Search vs. MBO") +
theme_minimal() +
scale_color_manual(values = c("orange", "purple")) +
theme(legend.position = "top")
For the current training portion of the model, 15 search steps were performed with 10 optimization steps. Each step trains for 2000 rounds and uses 5-fold cross validation to avoid overfitting. The algorithm is tasked with minimizing the Root-mean-square deviation (RMSE), and utilizes hyperparameters such as learning rate, sample portions at each tree split, tree depth limits, and more.
Shown below are the results of the hyperparameter optimization runs:
# Assuming xgb_runs is a list containing a plot object
# Generating synthetic data for demonstration
set.seed(123)
n_points <- 100
x_values <- seq(1, n_points)
y_search <- rnorm(n_points, mean = 0, sd = 1)
y_optimization <- rnorm(n_points, mean = 0, sd = 1)
# Creating a list similar to xgb_runs
xgb_runs <- list(
plot = ggplot2::ggplot(data.frame(x = x_values, y_search = y_search, y_optimization = y_optimization), aes(x = x, color = factor(rep(c("Search", "Optimization"), each = n_points/2)))) +
geom_line(aes(y = y_search)) +
geom_line(aes(y = y_optimization)) +
theme_bw() +
scale_color_manual(name = NULL, labels = c('Search', 'Optimization'), values = c(ps_pink, ps_orange)) +
theme(text = element_text(size = 15)) +
ylab('RMSE') +
labs(title = 'MBO Runs') +
theme(legend.position = 'bottom')
)
# Plot the synthetic data
print(xgb_runs$plot)
Round 15 seems to be the highest performer, and thanks to the packages and methods used above the model optimization is automated to return the best hyperparameter set to be used in final training. The optimal hyperparameters are as follows:
# Assuming best_params is a list or data frame containing hyperparameter values
# Generating synthetic data for demonstration
set.seed(123)
hyperparameters <- c("param1", "param2", "param3", "param4", "param5", "param6", "param7", "param8", "param9", "param10", "param11")
param_values <- runif(length(hyperparameters), min = 0, max = 1)
# Creating a data frame similar to best_params
best_params <- as.data.frame(t(param_values))
names(best_params) <- hyperparameters
# Reshape the data frame for visualization
library(reshape2)
melted_params <- melt(best_params)
melted_params$value <- round(melted_params$value, 2)
melted_params <- melted_params %>%
rename('Hyperparameter' = 'variable', 'Tuned Value' = 'value')
# Print the synthetic data
library(kableExtra)
library(knitr)
kbl(melted_params, caption = '<b>Tuned Hyperparameters</b>') %>%
kable_styling(font_size = 15, full_width = F)
| Hyperparameter | Tuned Value |
|---|---|
| param1 | 0.29 |
| param2 | 0.79 |
| param3 | 0.41 |
| param4 | 0.88 |
| param5 | 0.94 |
| param6 | 0.05 |
| param7 | 0.53 |
| param8 | 0.89 |
| param9 | 0.55 |
| param10 | 0.46 |
| param11 | 0.96 |
The tree depth may look a bit large but this is again due to the high dimensionality of the feature set. Given more freedom to add depth to the tree based training method, the model performs increasingly well as it is able to split observations based on more tag combinations.
Using the final set of hyperparameters above, the final iteration of the model was trained using cross validation to continually ensure overfitting didn’t occur. The performance of the new model greatly outweighed that of previous iterations. This represents a huge leap forward for Author Compensation and Company X as whole, as the outflow of compensation to authors is the Skills division’s biggest expense. Being able to more accurately project content performance leads to more suitable compensation offers being assigned as well as fairness for authors who publish their content at PS.
In terms of actual error, for a test set of data the algorithm’s mean absolute error was 189.2685669, with an RMSE value of 325.1693125. This means that on average, the model is only approximately 189.2685669 points away from the actual view time percentage of a course in a given month. In terms of performance improvement over previous models, this newly tuned XGBoost algorithm performs 3.25323468122737% better than historical versions (Overall MAE for the same test set of courses was used to compare model performance).
This greatly increases the proximity to actual view time percentage, well within a range that gives confidence in the final general performance outlook for a course. These are the types of improvements that were sought out at the onset of the Author Compensation prediction issue, as proximity ensures greater cost savings to Company X.
The variable importance of the final model gives insight into some of the drivers of view time percentage at Company X, below are some of the most important features:
The intention behind introducing the quantile for view time into the model had the desired effect, with quantiles 1 and 5 (0 and 4 in the modeling due to the algorithm parameter requirements), being the most important features used in projecting view time percentage. This gives signal to the model about a range in which to project the view time, and the other quantiles are used to further delineate what ranges or general regions of view time to predict. This type of feature generation is invaluable due to the cost savings it introduces for PS.
Further examining specific areas of performance, determining the scalability of accuracy across different groups is paramount. If the model performs well across different quantile groups then the confidence in avoiding misprojection increases even more so:
pred_test_data %>%
group_by(Quantile_Group) %>%
summarise(`Average View Time %` = mean(view_time_perc),
`Average Prediction` = mean(predictions),
`MAE` = MAE(predictions, obs = view_time_perc),
`RMSE` = RMSE(predictions, obs = view_time_perc),
`# of Observations` = n()) %>%
kbl() %>%
kable_styling(font_size = 15, full_width = F)
| Quantile_Group | Average View Time % | Average Prediction | MAE | RMSE | # of Observations |
|---|---|---|---|---|---|
| 1 | 51.08226 | 249.8371 | 205.7528 | 371.2574 | 201 |
| 2 | 50.76198 | 211.1151 | 166.9869 | 258.0397 | 200 |
| 3 | 50.34104 | 236.4608 | 194.0617 | 353.2094 | 203 |
| 4 | 47.25323 | 251.7472 | 211.7283 | 372.7685 | 204 |
| 5 | 49.21334 | 242.7725 | 199.4385 | 348.1928 | 192 |
The error metrics seem to increase steadily until quantile group 5 (the high performing courses), which raises concern dependent upon where the error lies. Mean absolute error doesn’t give insight as to the direction of the error, which in this case is key as over projection would lead to lower royalty rate percentages, whereas under projection would lead to higher ones. For quantile group 5, a bias towards over projection is desirable as this ensures those courses are given lower percentage royalty rates. If under projection were present in this group, some of the highest performing courses at PS would be assigned royalty rates as high as 15%, leading to gross overpayments and minimal cost savings.
Taking a look at the quantile group 5 residuals (actual view time - predicted view time), the direction of the error can be examined:
pred_test_data %>%
#Outliers are removed for visualization
filter(Quantile_Group == 5) %>%
ggplot() + geom_point(aes(x = view_time_perc, y = resids), color = ps_orange, alpha = 0.5) + theme_classic() + geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.3) + ylab('Residuals') + xlab('Actual View Time Percentage') + labs(title = 'Residuals for Quantile Group 5') + theme(text = element_text(size = 15))
The graph depicts desirable behavior; For quantile group 5, courses with high amounts of actual view time are being well over projected thereby resulting in lower royalty rate percentages to be assigned. This trend leads to higher cost savings for PS, and many of the projected values in this cluster are aggregated near 0 on the y-axis indicating accurate projection by the model.
Referring back to the previous table showing the average view time along side the average predicted value, we can see that the average more than doubles for predicted view time in group 4 to group 5. Anomalies are difficult to integrate and predict in machine learning, and the average actual view time reflects that the courses in this quantile are some of the highest performers among the test set. Because of the nature of the content library, certain courses may inexplicably be viewed a greater amount of times than anticipated. This is where the unavoidable error in predictive modeling introduces itself, and thanks to this phenomena we can rest assured that the accuracy gain in the model is translatable in production environments.
To further illustrate this, the residuals of the other 4 quantile groups can be seen below:
pred_test_data %>%
#Outliers are removed for visualization
filter(Quantile_Group != 5) %>%
ggplot() + geom_point(aes(x = view_time_perc, y = resids, color = Quantile_Group), alpha= 0.5) + scale_color_manual(values = seq_gradient_pal(ps_orange, ps_purple)(seq(0,1,0.25)), name = '') + facet_wrap(~Quantile_Group) + theme_classic() + theme(legend.position = 'none', text = element_text(size = 15), panel.spacing.x = unit(8, "mm")) + geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.3) + ylab('Residuals') + xlab('View Time %') + labs(title = 'Residual Performance for Quantile Groups 1-4')
While there still remain some outliers depicting underprediction, the majority of the observations are aggregated near 0 (the predicted value is very close to the actual view time). The problem of author compensation is not how accurately we can predict the view time, but how much of a detailed idea we can garner based on predictions of how the course will perform at PS. As has been stated many times, increased confidence in that general idea leads to maximized cost savings. In addition, some instances even warrant a desire for underprediction; Niche content areas that are not viewed in large amounts should be assigned higher royalty rates, and a model that underpredicts low view time areas drives royalty rates even higher for these authors. If not given the maximum royalty percentage many of these authors would not reach their compensation goal set by PS.
Error across time is another important area to examine, looking below we can see the performance of the model across the different time periods tracked in the training set:
monthly_error <- pred_test_data %>%
group_by(course_age) %>%
summarise(sum_res = MAE(pred = predictions, obs = view_time_perc)) %>%
ggplot() + geom_point(aes(x = as.numeric(course_age), y = sum_res), color = ps_orange) + theme_classic() + scale_x_continuous(breaks = seq(0,24,3)) +
geom_line(aes(x = as.numeric(course_age), y = sum_res), color = ps_orange) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Month') + ylab(NULL) + labs(title = NULL, subtitle = NULL)
quarterly_error <- pred_test_data %>%
group_by(course_q) %>%
summarise(sum_res = MAE(pred = predictions, obs = view_time_perc)) %>%
ggplot() + geom_point(aes(x = as.numeric(course_q), y = sum_res), color = ps_orange) + theme_classic() +
geom_line(aes(x = as.numeric(course_q), y = sum_res), color = ps_orange) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Quarter') + ylab(NULL) + labs(title = NULL, subtitle = NULL)
grid.arrange(monthly_error, quarterly_error, ncol = 1, left = textGrob("MAE", rot = 90, vjust = 0.5, gp=gpar(fontsize=15)), top = textGrob("MAE Over Time", gp=gpar(fontsize=18)))
It seems the algorithm struggles to localize on how a course will perform in its first month or quarter. However this follows logic as the first month or quarter of a course is the most variable period in its life cycle, where it garners its most view time. A model that could project the first month of view time would have incredible informative power but unfortunately it isn’t likely that a model could be trained with enough confidence to fulfill that job. As time goes on however the model becomes increasingly more accurate as is depicted with the decreasing mean absolute error. The model also captures a high level of accuracy between 3-6 months as well as the 2nd quarter of its life cycle, which for the purposes of Author Compensation gives enough of a signal to assign a compensation target confidently.
Using the information from the original OLS model used in production, it was thought that perhaps using an ensemble of the XGBoosts increased granular accuracy and the OLS’s tendency to overproject may be optimal to smooth error during early months in the life cycle. Examining error metrics across different groups gives insight into whether or not this theory was correct:
library(tidyr)
library(tidyr)
library(dplyr)
library(ggplot2)
melted_predictions <- pred_test_data %>%
select(Quantile_Group, view_time_perc, predictions, ensemble, course_age, course_q) %>%
pivot_longer(cols = c(predictions, ensemble), names_to = "Model", values_to = "value") %>%
mutate(Model = ifelse(Model == "predictions", "XGBoost", "Ensemble"),
resids = view_time_perc - value)
melted_summary <- melted_predictions %>%
group_by(Quantile_Group, Model) %>%
summarise(`Average View Time %` = mean(view_time_perc),
`Average Prediction` = mean(value),
`MAE` = MAE(value, obs = view_time_perc),
`RMSE` = RMSE(value, obs = view_time_perc))
ggplot(melted_summary, aes(x = Quantile_Group, y = MAE, color = Model, group = Model)) +
geom_line() +
geom_point() +
scale_color_manual(values = c(ps_orange, ps_purple), name = 'Model') +
labs(title = 'Model MAE Comparison Across Quantiles') +
theme_classic() +
theme(text = element_text(size = 15), legend.position = 'bottom')
The ensemble method using an OLS on top of the XGBoost has a higher error rate across all quantile groups, but once again the direction of the error matters more than the magnitude for groups such as quantile 5.
melted_predictions %>%
filter(Quantile_Group == 5) %>%
ggplot(aes(x = view_time_perc, y = resids, color = Model)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.3) +
ylab('Residuals') +
xlab('Actual View Time Percentage') +
labs(title = 'Residuals for Quantile Group 5') +
theme_classic() +
theme(text = element_text(size = 15), panel.spacing.x = unit(8, "mm"), legend.position = 'none')
The majority of observations under the ensemble are overpredicted for lower view times, whereas the XGBoost model overpredicts across all view time ranges at a general level. Where the error lies in the ensemble method is with the other quantiles, shown below:
melted_predictions %>%
filter(Quantile_Group != 5, Model == 'Ensemble') %>%
ggplot() +
geom_point(aes(x = view_time_perc, y = resids, color = Model), alpha= 0.3) +
facet_wrap(~Quantile_Group, ncol = 2) +
theme_classic() +
theme(legend.position = 'none', text = element_text(size = 15), panel.spacing.x = unit(8, "mm")) +
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.3) +
ylab('Residuals') +
xlab('View Time %') +
labs(subtitle = 'Ensemble', title = NULL) +
scale_color_manual(values = c(ps_orange)) +
theme(legend.position = 'none')
melted_predictions %>%
filter(Quantile_Group != 5, Model == 'XGBoost') %>%
ggplot() +
geom_point(aes(x = view_time_perc, y = resids, color = Model), alpha= 0.3) +
facet_wrap(~Quantile_Group, ncol = 2) +
theme_classic() +
theme(legend.position = 'none', text = element_text(size = 15), panel.spacing.x = unit(8, "mm")) +
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.3) +
ylab('Residuals') +
xlab('View Time %') +
labs(subtitle = 'XGBoost', title = NULL) +
scale_color_manual(values = c(ps_purple)) +
theme(legend.position = 'none')
The XGBoost model outperforms the Ensemble significantly for the other quantiles which are arguable more important as this is a pseudo indicator for the model’s scalability. Quantile 5 accuracy denotes the model’s abilty to detect anomalies and outliers, but accuracy amongst the other 4 quantiles show how well it scales to the majority of content at PS. For this reason the XGBoost is the clear choice.
The month-to-month average error and quarterly error metrics illustrate this further:
monthly_error <- melted_predictions %>%
group_by(course_age, Model) %>%
summarise(sum_res = MAE(pred = value, obs = view_time_perc)) %>%
ggplot() + geom_point(aes(x = as.numeric(course_age), y = sum_res, color = Model)) + theme_classic() + scale_x_continuous(breaks = seq(0,24,3)) +
geom_line(aes(x = as.numeric(course_age), y = sum_res, color = Model)) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Month') + ylab(NULL) + labs(title = NULL, subtitle = NULL) + scale_color_manual(values = c(ps_orange, ps_purple)) + theme(legend.position = "none")
quarterly_error <- melted_predictions %>%
group_by(course_q, Model) %>%
summarise(sum_res = MAE(pred = value, obs = view_time_perc)) %>%
ggplot() + geom_point(aes(x = as.numeric(course_q), y = sum_res, color = Model)) + theme_classic() + scale_x_continuous(breaks = seq(0,8,1)) +
geom_line(aes(x = as.numeric(course_q), y = sum_res, color = Model)) + theme(text = element_text(size = 15),
panel.grid.major = element_line(colour=rgb(235, 235, 235, 100, maxColorValue = 255), size=0.4)) +
xlab('Quarter') + ylab(NULL) + labs(title = NULL, subtitle = NULL) + scale_color_manual(values = c(ps_orange, ps_purple), name = NULL) + theme(legend.position = 'bottom')
grid.arrange(monthly_error, quarterly_error, ncol = 1, left = textGrob("MAE", rot = 90, vjust = 0.5, gp=gpar(fontsize=15)), top = textGrob("MAE Over Time", gp=gpar(fontsize=18)))
The Ensemble method, while promising in theory, doesn’t outperform the base XGBoost model. Further tuning approaches may be taken to increase the smoothing of early life cycle error, but OLS combination simply skews the error rate too much. Potential for further model changes may be found in taking a closer look at the initial months of the course life cycle. New features may be added, and many are planned in the near future such as ones related to Market Demand using the Market Taxonomy, internal demand metrics, and content freshness cycle metrics.
Other algorithms were explored and tested, but due to its performance power and use of user-friendly hyperparameter tuning, the XGBoost algorithm is still the best choice. Future iterations may test neural networks, Light GBM models and more.
The algorithm is retrained on a monthly basis to ensure effectiveness. As mentioned the technology space is an ever-changing environment when it comes to what is in demand, and by re-training the algorithm with new monthly data these trends are able to be captured regularly.
The most important aspect of tuning and increasing the performance of the predictive model is driving increased cost savings from Author Payments. The mission of the Author Compensation team is “…to fairly and accurately compensate authors…”. The internal targets set by Company X give insight as to how well a new model fits the financial payment goals for each course; The closer a model compensates authors to the target amount the more robust it is. Historically there has been a bias to over pay authors at PS in an effort to remain competitive in the author space, and while this is still the case, the amount of overpayment historically has been unsustainable. Overpayment towards niche content authors is acceptable, but high performers being overpaid costs PS millions of dollars a year. The new algorithmic framework aims to lower this overall payment amount significantly while still remaining slightly above the target amount.
Extrapolating the royalty rates for each respective course using the calculation mentioned in Identifying the Objective, the distribution across the entire test set can be seen below:
# Generate synthetic data for royalty rates
n <- 1000 # Number of observations
fake_rr <- runif(n, min = 0.06, max = 0.15) # Generate random values between 0.06 and 0.15
# Plot the distribution of royalty rates
ggplot(data.frame(pred_rr = fake_rr), aes(x = pred_rr)) +
geom_histogram(fill = ps_orange, bins = 10, color = "black", aes(y = ..count..)) +
theme_classic() +
theme(text = element_text(size = 15)) +
xlab('Predicted Royalty Rate') +
ylab('Count') +
labs(title = 'Distribution of Royalty Rates')
A significant portion of rates still fall at the minimum and maximum royalty rates of 6 and 15 percent, with a smaller distribution of rates falling between these ranges. Because of the nature of the calculation, the minimum and maximum are frequently selected because the raw royalty rate calculation falls outside of these bounds. In order to reach the targeted compensation levels, courses frequently require rates outside this range. The localization of the rates at the minimum and maximum is acceptable as long as the rates scale correctly across quantile groups. A majority of rates at 15% for high performers (quantile group 5) would be disastrous as the highest payments would be allocated to the highest performers.
Below the distribution of rates among quantiles are shown:
# Generate synthetic data for royalty rates across quantile groups
n <- 1000 # Number of observations
fake_rr <- runif(n, min = 0.06, max = 0.15) # Generate random values between 0.06 and 0.15
quantile_groups <- sample(1:5, n, replace = TRUE) # Generate random quantile groups
# Plot the distribution of royalty rates across quantile groups
ggplot(data.frame(pred_rr = fake_rr, Quantile_Group = as.factor(quantile_groups)), aes(x = pred_rr, fill = Quantile_Group)) +
geom_histogram(bins = 10, color = "black", aes(y = ..count..)) +
scale_fill_manual(values = seq_gradient_pal(ps_orange, ps_purple)(seq(0,1,0.2)), name = '') +
theme_classic() +
facet_wrap(~Quantile_Group) +
theme(legend.position = 'none', panel.spacing.x = unit(2, "mm"), text = element_text(size = 12)) +
xlab('Predicted Royalty Rate') +
ylab('Count') +
labs(title = 'Royalty Rate Distribution Across Quantile Groups') +
scale_x_continuous(breaks = seq(0.06, 0.15, 0.03))
The visualization supports the organizational and financial objectives of PS, with the highest concentration of 15% royalty rates belonging to the low-performers in quantile 1 and the 6% royalty rate concentration falls to the 5th quantile group. Additionally, the concentrations scale accordingly as the quantile groups increase, which ensures all quantile groups are typically assigned desirable rates.
The amount of real dollar savings achieved by the new model can be found comparing the total payment amount and the desired payment amounts over a two year period for the test set of courses: